In this practical, we are going to explore statistical testing methods for metagenomic data, how to visualize the results, and how to train machine learning models using the SIAMCAT package.

1 Setup

1.1 Preparing the R environment

In order to get started, we should first prepare our R environment and load the packages we will need later on. Additionally, the data used in this practical are stored on the EMBL servers and we can set the base path for the downloads.

library("tidyverse") # for general data wrangling and plotting
library("SIAMCAT")   # for statistical and ML analyses

data.location <- 'https://embl.de/download/zeller/metaG_course/'

1.2 Loading the Data

In this practical, we are going to have a look at the data from Zeller et al. MSB 2014. In this study, the authors recruited patients with colorectal cancer (CRC) and healthy controls (CTR) and performed shotgun metagenomic sequencing of fecal samples. The raw data have already been pre-processed and analyzed with the mOTUs2 taxonomic profiler.

1.2.1 Features

First, we are going to load the taxonomic profiles and store them as a matrix.

fn.feat.fr  <- paste0(data.location, '/motus_profiles/FR-CRC.motus')
tax.profiles <- read.table(fn.feat.fr, sep = '\t', quote = '', 
                           comment.char = '', skip = 61, 
                           stringsAsFactors = FALSE, check.names = FALSE, 
                           row.names = 1, header = TRUE)
tax.profiles <- as.matrix(tax.profiles)

The taxonomic profiles contain absolute counts and can easily be transformed into relative abundances using the prop.table function:

rel.tax.profiles <- prop.table(tax.profiles, 2)

1.2.2 Metadata

Additionally, we also need the information which sample belongs to which group. Therefore, we are loading the metadata table as well:

fn.meta.fr  <- paste0(data.location, '/metadata/meta_FR-CRC.tsv')
df.meta <- read_tsv(fn.meta.fr)
df.meta
table(df.meta$Group)
## 
## ADA CRC CTR NAA 
##  15  53  61  27

We are interested in the comparison between control samples (CTR) and colorectal cancer samples (CRC), so we first remove the other samples, which represent advanced adenoma (ADA) or non-advanced adenoma (NAA). Also, we transform the metadata into a data.frame object (which is easier for some analyses later on).

df.meta <- df.meta %>% 
  filter(Group %in% c('CRC', 'CTR')) %>% 
  as.data.frame()
rownames(df.meta) <- df.meta$Sample_ID

# restrict the taxonomic profiles to CRC and CTR samples
tax.profiles <- tax.profiles[,rownames(df.meta)]
rel.tax.profiles <- rel.tax.profiles[,rownames(df.meta)]

1.3 Feature filtering

Currently, the matrix of taxonomic profiles contains 14213 different bacterial species. Of those, not all will be relevant for our question, since some are present only in a handful of samples (low prevalence) or at extremely low abundance. Therefore, it can make sense to filter your taxonomic profiles before you begin the analysis. Here, we could for example use the maximum species abundance as a filtering criterion. All species that have a relative abundance of at least 1e-03 in at least one of the samples will be kept, the rest is filtered out.

species.max.value <- apply(rel.tax.profiles, 1, max)
f.idx <- which(species.max.value > 1e-03)
rel.tax.profiles.filt <- rel.tax.profiles[f.idx,]

Additionally, the mOTUs2 profiler can also estimate how much of the relative abundance cannot be classified. We also filter out this share of “Unknown”.

# unclassified are indicated by -1 in the mOTUs2 profiles
idx.um <- which(rownames(rel.tax.profiles.filt) == '-1')
rel.tax.profiles.filt <- rel.tax.profiles.filt[-idx.um,]

2 Association Testing

Now that we have set up everyting, we can test all microbial species for statistically significant differences. In order to do so, we perform a Wilcoxon test on each individual bacterial species.

p.vals <- rep_len(1, nrow(rel.tax.profiles.filt))
names(p.vals) <- rownames(rel.tax.profiles.filt)
stopifnot(all(rownames(df.meta) == colnames(rel.tax.profiles.filt)))
for (i in rownames(rel.tax.profiles.filt)){
  x <- rel.tax.profiles.filt[i,]
  y <- df.meta$Group
  t <- wilcox.test(x~y)
  p.vals[i] <- t$p.value
}
head(sort(p.vals))
## Fusobacterium nucleatum subsp. animalis [ref_mOTU_v25_01001] 
##                                                 1.072932e-07 
##                  Dialister pneumosintes [ref_mOTU_v25_03630] 
##                                                 5.047528e-06 
##                    Hungatella hathewayi [ref_mOTU_v25_03436] 
##                                                 3.079069e-05 
##           Olsenella sp. Marseille-P2300 [ref_mOTU_v25_10001] 
##                                                 2.283988e-04 
##                         Clostridium sp. [ref_mOTU_v25_03680] 
##                                                 2.756465e-04 
##                   [Eubacterium] eligens [ref_mOTU_v25_02375] 
##                                                 3.395879e-04

The species with the most significant effect seems to be Fusobacterium nucleatum, so let us take a look at the distribution of this species:

species <- 'Fusobacterium nucleatum subsp. animalis [ref_mOTU_v25_01001]'
df.plot <- tibble(fuso=rel.tax.profiles.filt[species,],
                  label=df.meta$Group)
df.plot %>% 
  ggplot(aes(x=label, y=fuso)) + 
    geom_boxplot() + 
    xlab('') + 
    ylab('F. nucleatum rel. ab.')

Let us remember that log-scales are important when visualizing relative abundance data!

df.plot %>% 
  ggplot(aes(x=label, y=log10(fuso + 1e-05))) + 
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(width = 0.08) + 
    xlab('') + 
    ylab('log10(F. nucleatum rel. ab.)')

3 SIAMCAT Association Testing

We can also use the SIAMCAT R package to test for differential abundance and produce standard visualizations.

library("SIAMCAT")

Within SIAMCAT, the data are stored in the SIAMCAT object which contains the feature matrix, the metadata, and information about the groups you want to compare.

sc.obj <- siamcat(feat=rel.tax.profiles, meta=df.meta, 
                  label='Group', case='CRC')
## + starting create.label
## Label used as case:
##    CRC
## Label used as control:
##    CTR
## + finished create.label.from.metadata in 0.026 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 114 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.444 s

We can use SIAMCAT for feature filtering as well:

sc.obj <- filter.features(sc.obj, filter.method = 'abundance', cutoff = 1e-03)
## Features successfully filtered
sc.obj <- filter.features(sc.obj, filter.method = 'prevalence', 
                          cutoff = 0.05, feature.type = 'filtered')
## Features successfully filtered
sc.obj
## siamcat-class object
## label()                Label object:         61 CTR and 53 CRC samples
## filt_feat()            Filtered features:    839 features after abundance, prevalence filtering
## 
## contains phyloseq-class experiment-level object @phyloseq:
## phyloseq@otu_table()   OTU Table:            [ 14213 taxa and 114 samples ]
## phyloseq@sam_data()    Sample Data:          [ 114 samples by 13 sample variables ]

Now, we can test the filtered feature for differential abundance with SIAMCAT:

sc.obj <- check.associations(sc.obj, detect.lim = 1e-05, 
                             fn.plot = './associations.pdf')

4 Exercises: Visualization

  • The associations metrics computed by SIAMCAT are stored in the SIAMCAT object and can be extracted by using associations(sc.obj), if you want to have a closer look at the results for yourself. Plot a volcano plot of the associations between cancer and controls using the output from SIAMCAT.

  • Optional Create a ordination plot for our data and colour the samples by group. How would you interpret the results? Try out different ecological distances. How does the choice of distance affect the group separation? (Tip: make sure to check out the vegdist function in the vegan package and also the pco function in the labdsv package)

5 Machine learning with SIAMCAT

5.1 Machine learning workflow

The SIAMCAT machine learning workflow consists of several steps:

Since we already created a SIAMCAT object and filtered the raw data, we can start directly with the next step.

5.2 Normalization

SIAMCAT offers a few normalization approaches that can be useful for subsequent statistical modeling in the sense that they transform features in a way that can increase the accuracy of the resulting models. Importantly, these normalization techniques do not make use of any label information (patient status), and can thus be applied up front to the whole data set (and outside of the following cross validation).

sc.obj <- normalize.features(sc.obj, norm.method = 'log.std',
                             norm.param = list(log.n0=1e-05, sd.min.q=0))
## Features normalized successfully.
sc.obj
## siamcat-class object
## label()                Label object:         61 CTR and 53 CRC samples
## filt_feat()            Filtered features:    839 features after abundance, prevalence filtering
## associations()         Associations:         Results from association testing
##                                              with 11 significant features at alpha 0.05
## norm_feat()            Normalized features:  839 features normalized using log.std
## 
## contains phyloseq-class experiment-level object @phyloseq:
## phyloseq@otu_table()   OTU Table:            [ 14213 taxa and 114 samples ]
## phyloseq@sam_data()    Sample Data:          [ 114 samples by 13 sample variables ]

5.3 Cross validation split

Cross validation is a technique to assess how well an ML model would generalize to external data by partionining the dataset into training and test sets. Here, we split the dataset into 10 parts and then train a model on 9 of these parts and use the left-out part to test the model. The whole process is repeated 10 times.

sc.obj <- create.data.split(sc.obj, num.folds = 10, num.resample = 10)
## Features splitted for cross-validation successfully.

5.4 Model Training

Now, we can train a LASSO logistic regression classifier in order to distinguish CRC cases and controls.

sc.obj <- train.model(sc.obj, method='lasso')
## Trained lasso models successfully.

5.5 Predictions

This function will automatically apply the models trained in cross validation to their respective test sets and aggregate the predictions across the whole data set.

sc.obj <- make.predictions(sc.obj)
## Made predictions successfully.

5.6 Model Evaluation

Calling the evaluate.predictions funtion will result in an assessment of precision and recall as well as in ROC analysis, both of which can be plotted as a pdf file using the model.evaluation.plot funtion (the name of/path to the pdf file is passed as an argument).

sc.obj <- evaluate.predictions(sc.obj)
## Evaluated predictions successfully.
model.evaluation.plot(sc.obj, fn.plot = './eval_plot.pdf')
## Plotted evaluation of predictions successfully to: ./eval_plot.pdf

5.7 Model interpretation

Finally, the model.interpretation.plot function will plot characteristics of the models (i.e. model coefficients or feature importance) alongside the input data aiding in understanding how / why the model works (or not).

model.interpretation.plot(sc.obj, consens.thres = 0.7,
                          fn.plot = './interpretation_plot.pdf')
## Successfully plotted model interpretation plot to: ./interpretation_plot.pdf

6 Exercise: Prediction on external data

On the EMBL cluster, there is another dataset from a colorectal cancer metagenomic study. The study population was recruited in Austria, so you can find the data under:

fn.feat.at <- paste0(data.location, '/motus_profiles/AT-CRC.motus')
fn.meta.at <- paste0(data.location, '/metadata/meta_AT-CRC.tsv')
  • Apply the trained model on this dataset and check the model performance on the external dataset.

  • Train a SIAMCAT model on the Austrian dataset and apply it to the French dataset. How does the model transfer on the external dataset compare between the two datasets? Compare also the feature weights when training on the French or Austrian dataset.
    Note: You can supply several SIAMCAT objects to the function model.evaluation.plot and compare two ROC curves in the same plot.

7 Exercise: Taxonomic vs functional predictors

In addition to the taxonomic profiles, we also created functional profiles for the French dataset. You can find it under:

fn.feat.fr.kegg <- paste0(data.location, 
                          '/functional_profiles/KEGG_kos_FR-CRC.tsv')
  • Explore the distribution of the functional data (abundance distribution, prevalence, etc.) and compare it to what you observe with the taxonomic profiles. Which filtering regime would make sense for functional data?

  • Use SIAMCAT to train a model based on the functional KEGG profiles and compare it to the one trained on taxonomic profiles.
    Note Since the functional profiles will have many thousands features, it makes sense to perform feature selection on your dataset. You can supply the parameters for the feature selection to the train.model function in SIAMCAT.

8 Further information

You can find more information about SIAMCAT on https://siamcat.embl.de or on Bioconductor under https://www.bioconductor.org/packages/release/bioc/html/SIAMCAT.html

There you can also find several vignettes which go into more detail about different applications for SIAMCAT.

9 SessionInfo

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DiagrammeR_1.0.6.1 SIAMCAT_1.11.0     phyloseq_1.32.0    mlr_2.18.0        
##  [5] ParamHelpers_1.14  forcats_0.5.0      stringr_1.4.0      dplyr_1.0.2       
##  [9] purrr_0.3.4        readr_1.4.0        tidyr_1.1.2        tibble_3.0.4      
## [13] ggplot2_3.3.2      tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1    ellipsis_0.3.1      XVector_0.28.0     
##   [4] fs_1.5.0            rstudioapi_0.11     farver_2.0.3       
##   [7] fansi_0.4.1         lubridate_1.7.9     xml2_1.3.2         
##  [10] codetools_0.2-18    splines_4.0.2       PRROC_1.3.1        
##  [13] knitr_1.30          ade4_1.7-16         jsonlite_1.7.1     
##  [16] pROC_1.16.2         broom_0.7.2         gridBase_0.4-7     
##  [19] cluster_2.1.0       dbplyr_2.0.0        compiler_4.0.2     
##  [22] httr_1.4.2          backports_1.2.0     assertthat_0.2.1   
##  [25] Matrix_1.2-18       cli_2.1.0           visNetwork_2.0.9   
##  [28] htmltools_0.5.0     prettyunits_1.1.1   tools_4.0.2        
##  [31] igraph_1.2.6        gtable_0.3.0        glue_1.4.2         
##  [34] reshape2_1.4.4      LiblineaR_2.10-8    fastmatch_1.1-0    
##  [37] Rcpp_1.0.5          parallelMap_1.5.0   Biobase_2.48.0     
##  [40] cellranger_1.1.0    vctrs_0.3.4         Biostrings_2.56.0  
##  [43] multtest_2.44.0     ape_5.4-1           nlme_3.1-150       
##  [46] iterators_1.0.13    xfun_0.19           rvest_0.3.6        
##  [49] lifecycle_0.2.0     XML_3.99-0.5        beanplot_1.2       
##  [52] zlibbioc_1.34.0     MASS_7.3-53         scales_1.1.1       
##  [55] hms_0.5.3           parallel_4.0.2      biomformat_1.16.0  
##  [58] rhdf5_2.32.4        RColorBrewer_1.1-2  BBmisc_1.11        
##  [61] curl_4.3            yaml_2.2.1          gridExtra_2.3      
##  [64] stringi_1.5.3       S4Vectors_0.26.1    corrplot_0.84      
##  [67] foreach_1.5.1       checkmate_2.0.0     permute_0.9-5      
##  [70] BiocGenerics_0.34.0 shape_1.4.5         rlang_0.4.8        
##  [73] pkgconfig_2.0.3     matrixStats_0.57.0  evaluate_0.14      
##  [76] lattice_0.20-41     Rhdf5lib_1.10.1     htmlwidgets_1.5.2  
##  [79] labeling_0.4.2      tidyselect_1.1.0    plyr_1.8.6         
##  [82] magrittr_1.5        R6_2.5.0            IRanges_2.22.2     
##  [85] generics_0.1.0      DBI_1.1.0           pillar_1.4.6       
##  [88] haven_2.3.1         withr_2.3.0         mgcv_1.8-33        
##  [91] survival_3.2-7      modelr_0.1.8        crayon_1.3.4       
##  [94] rmarkdown_2.5       progress_1.2.2      grid_4.0.2         
##  [97] readxl_1.3.1        data.table_1.13.2   vegan_2.5-6        
## [100] infotheo_1.2.0      reprex_0.3.0        digest_0.6.27      
## [103] stats4_4.0.2        munsell_0.5.0       glmnet_4.0-2